Introduction

In this project we use a dataset of wine reviews to predict review points from numerical, categorical and textual predictors.

The data is from Kaggle Datasets, and covers 150k wine reviews along with some attributes of the wines. It can be found here. (A (free) Kaggle login is required to access it directly from kaggle.com). The data was originally scraped from WineEnthusiast.

The dataset contains the following columns:

This is a particularly interesting problem for several reasons:

Methods

Feature Engineering

## [1] "points"           "price"            "continentTopFive"
## [4] "topic1"           "sentiment"

Training and Test Set Generation

## [1] TRUE

Model Selection

Model #1

## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 2824.7, df = 20, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod1), 5000)
## W = 0.91738, p-value < 2.2e-16

Model 2

## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 5604, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod2), 5000)
## W = 0.99597, p-value = 1.765e-10

Model 3

## 
##  studentized Breusch-Pagan test
## 
## data:  mod3
## BP = 4521.8, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod3), 5000)
## W = 0.99349, p-value = 2.522e-14

Model 4

## 
##  studentized Breusch-Pagan test
## 
## data:  mod4
## BP = 4464.5, df = 21, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod4), 5000)
## W = 0.99624, p-value = 5.742e-10

Results

model CV RMSE
1 2.561
2 2.433
3 2.455
4 2.571
preds = predict(mod2, newdata=wine_test)
success_ratio = sum(wine_test$points >= preds - 2*2.433 & wine_test$points <= preds + 2*2.433) / nrow(wine_test)

Discussion

Appendix

## # A tibble: 65,304 x 3
##    topic       term         beta
##    <int>      <chr>        <dbl>
##  1     1        100 4.664335e-04
##  2     2        100 7.077942e-04
##  3     1   20222030 1.119677e-06
##  4     2   20222030 6.400127e-07
##  5     1        age 2.741178e-03
##  6     2        age 4.352850e-03
##  7     1      ahead 8.223265e-05
##  8     2      ahead 3.273607e-05
##  9     1 background 3.161070e-04
## 10     2 background 1.449376e-04
## # ... with 65,294 more rows

## # A tibble: 270 x 4
##        term       topic1       topic2  log_ratio
##       <chr>        <dbl>        <dbl>      <dbl>
##  1   accent 0.0011319352 1.632377e-03  0.5281827
##  2     acid 0.0101020896 1.089604e-02  0.1091502
##  3      add 0.0015936485 6.447203e-04 -1.3055881
##  4      age 0.0027411779 4.352850e-03  0.6671645
##  5  alcohol 0.0007706188 2.302083e-03  1.5788504
##  6   almond 0.0015819072 2.658711e-04 -2.5728660
##  7   almost 0.0008838961 2.106729e-03  1.2530560
##  8    along 0.0013260939 1.152072e-03 -0.2029525
##  9 alongsid 0.0010304528 9.993428e-05 -3.3661550
## 10     also 0.0036764899 4.080261e-04 -3.1715957
## # ... with 260 more rows